MFEAFN: Multi-scale feature enhanced adaptive fusion network for image semantic segmentation

Low-level features contain spatial detail information, and high-level features contain rich semantic information. Semantic segmentation research focuses on fully acquiring and effectively fusing spatial detail with semantic information. This paper proposes a multiscale feature-enhanced adaptive fusion network named MFEAFN to improve semantic segmentation performance. First, we designed a Double Spatial Pyramid Module named DSPM to extract more high-level semantic information. Second, we designed a Focusing Selective Fusion Module named FSFM to fuse different scales and levels of feature maps. Specifically, the feature maps are enhanced to adaptively fuse these features by generating attention weights through a spatial attention mechanism and a two-dimensional discrete cosine transform, respectively. To validate the effectiveness of FSFM, we designed different fusion modules for comparison and ablation experiments. MFEAFN achieved 82.64% and 78.46% mIoU on the PASCAL VOC2012 and Cityscapes datasets. In addition, our method has better segmentation results than state-of-the-art methods.


Introduction
Semantic image segmentation aims to annotate each pixel in an image with semantic information, a challenging task in computer vision. Fully convolutional networks (FCNs) [1] are pioneering work in deep learning segmentation networks. The essential innovation is convolutional layers instead of fully connected layers, enabling end-to-end semantic segmentation by generalizing pixel points in an image with the same semantic meaning. However, FCN classifies each pixel without considering how they relate to each other.
To obtain more abundant context information, improve the accuracy of segmentation results. Researchers have proposed methods for aggregating contexts, such as pyramid pooling modules (PPM) [2] and atrous spatial pyramid pooling modules (ASPP) [3]. PPM [2] can aggregate multiscale contextual information to obtain global context. ASPP [3] uses dilation convolution to increase the receptive field size without adding additional parameters. However, using a single pyramid pooling module, the image contains objects of different sizes in the same class and cannot capture too large or too small targets well. Therefore, we designed a double-branch structured feature extraction module, Double Spatial Pyramid Module (DSPM), which consists of two parallel Spatial Pyramid Modules (SPM1 and SPM2) with different atrous rates. SPM1 is used to capture small objects in the image by using the dilation convolution with lower atrous rates, and SPM2 is used to capture large objects in the image by using the dilation convolution with more significant atrous rates.
The feature fusion is commonly performed by summation or concatenation operations, but this approach is ineffective. Because features at different levels or scales contain different semantic information. For features at different levels, lower-level features contain more location and detail information, while higher-level features have rich semantic information. Features at different levels contain different feature representations; simply using the summation or concatenation operation might consider the different information in each feature map equally, which could result in spatial and semantic information interfering. To more effectively fuse the features at different levels or scales, researchers have borrowed the idea of an attention mechanism in designing the network model. BiseNetV1 [4] proposed a feature fusion module (FFM) that concatenates the output features of spatial path and context path, generates channel weight vectors through a global pooling operation, and then reweights and fuses the features through two fully connected layers. LwMLA-NET [5] proposes the Multi-Level Attention (MLA) Module, which uses proposed spatial attention, channel attention, and pixel attention to extract relevant contextual information from different levels of abstraction, significantly reducing the computational cost by avoiding the propagation of unimportant features to the decoder. However, the attention blocks in MLA are connected in a serial configuration and do not enhance the adaptive fusion of these features as our proposed FSFM module can. Furthermore, MLA uses pooling. In contrast, FSFM uses a two-dimensional discrete cosine transform (2DDCT) to reduce the loss of information caused by pooling operations. Sknet [6] is an attention mechanism based on a convolution kernel. The feature maps obtained from split input feature maps are reweighted for adaptive fusion features. Sanghyun Woo et al. [7] proposed an attention mechanism module, CBAM, which combines channel and spatial to generate attention charts along spatial and channel dimensions for adaptive feature optimization. The above three methods apply the global average pooling (GAP) operation to obtain global information. However, FcaNet [8] demonstrated that it is challenging to obtain complex information about the input features by simply using GAP. Inspired by signal processing, FcaNet used the two-dimensional discrete cosine transform (2DDCT) [9] to transform the image to the frequency domain to obtain more frequency components, including GAP. We designed a Focusing Selective Fusion Module (FSFM) based on these observations. Specifically, the input FSFM feature map is first transformed through a spatial attention mask to obtain the weight vector of the spatial dimension; then, the feature map is transformed into the frequency domain by the 2DDCT transform to obtain more frequency components for enhancing the adaptive fusion features. Based on two essential modules, FSFM and DSPM, we propose a multiscale feature-enhanced adaptive fusion network named MFEAFN to improve semantic segmentation performance. In summary, our contributions are: • We design a Focusing Selective Fusion Module (FSFM), which can enhance the adaptive fusion of these features by generating spatial and frequency correlation weight mappings for each feature map. FSFM is not only used to fuse features at different levels but also to fuse features with contextual and global information.
• A Double Spatial Pyramid Module (DSPM) was designed to extract objects of different sizes from the same category more efficiently.

Spatial pyramid pooling
Spatial pyramid pooling (SPP) [21] uses different window sizes and step sizes for different output scales to ensure that the output scales are the same and can fuse multiple-scale features to capture rich contextual information. ICNet [22] divides the images into high, medium, and low resolution layers. The low-resolution images are first allowed to pass through the semantic segmentation network to generate coarse segmentation results; afterward, the cascade label guidance and cascade label guidance strategy integrate the medium and high resolution features to optimize the previously generated coarse segmentation results progressively. PSPNet [2] improves the ability to obtain global information by aggregating contextual information from different regions through the pyramid pooling module. DeepLabv2 [23], DeepLabv3 [3], and DeepLabv3+ [18] apply several parallel atrous convolutions with different rates (called Atrous Spatial Pyramid Pooling, or ASPP) to capture rich contextual information. DSNet [24] proposes a Context-Guided Dynamic Sampling (CGDS) module that adaptively samples spatially useful segmentation information in spatial by obtaining an efficient representation of rich shape and scale information. APCNet [25] proposes the Adaptive Context Module (ACM), which uses the GLA to compute the context vector for each local location to aggregate contextual information. SpineNet-Seg [26] is a network discovered by NAS, scale-Perarm network semantic segmentation. CFPNet [27] proposes the Channel-wise Feature Pyramid (CFP), a module that jointly extracts feature maps of various sizes and reduces the number of parameters. FPANet [28] designed a lightweight Feature Pyramid Fusion Module (FPFM) to fuse two different levels of features.

Attention module
The basic encoder-decoder encoder compresses the information of the whole sequence into a fixed-length vector, causing information loss. The attention mechanism was subsequently introduced in computer vision for target detection [29] and semantic segmentation [30]. The model structure of the soft attention mechanism is divided into three attention domains: the channel domain, the spatial domain, and the mixed domain. The signal on each channel is given a weight in the channel domain to signify the channel's significance to the essential information. Typically, channel masks are created first, and then significance is assessed for each channel, representing SENet [31]. SENet first pooled the scaling factor globally for each channel to obtain a scalar called Squeeze. Then, the original channel element was multiplied by the corresponding channel's weight to obtain a new feature map. Sknet [6] is an enhanced version of SENet, which can adaptively adjust its receptive field by using different weights of the convolution kernel for different images. Spatial domain key information is extracted by the spatial transformation of spatial domain information in images. Usually, a spatial mask of the same size as the feature map is formed, and then the importance of each position is calculated, representing the Spatial Attention Module. The attention of the spatial domain is to ignore the information in the channel domain and treat the features in each channel equally. This approach will limit the spatial domain transformation method to the feature extraction stage of the original image when applied to other layers of the neural network layer. The attention of the channel domain is to the global average pooling the information in a channel and ignoring the local information in each channel. The attention mechanism model of the mixed domain is designed. At the same time, the importance of channel and spatial attention is calculated, representing BAM [32] and CBAM [7]. DANet [30] introduces spatial and positional attention to resolve the differences between pixels of a category arising during the convolution process. EMANet [33] then proposes the Expectation-Maximization (EM) algorithm to learn attention features, solving problems such as the computationally intensive DANet.

Discrete cosine transform
A discrete cosine transform is a standard tool in the field of signal processing. In recent years, several applications introducing discrete cosine transformation have emerged in computer vision with the development of deep learning. To classify images encoded by DCT, Ulicny et al. [34] used a CNN. By feeding the rearranged DCT coefficients to the CNN, Lo et al. [35] accomplished semantic segmentation on the DCT representation. FcaNet [8] was cut from the frequency domain, and the authors proved that the global average pooling (GAP) and the 2-D discrete cosine transform (2DDCT) lowest frequency component is proportional, and more frequency components are introduced through the DCT transform to utilize the information thoroughly. Shen [36] et al. proposed a new mask representation by applying the discrete cosine transform (DCT) to encode a high-resolution binary-valued grid mask as a compact vector.

Backbone network
At present, VGGNet [37], Inception [38] and ResNet [39]are popular convolution neural networks. VGGNet [37] investigates the relationship between a convolution neural network's depth and its performance by using a smaller convolution kernel to introduce nonlinear transformation without affecting input and output dimensions, increasing network expressivity, reducing computation, and training and predicting with the multi-scale method, which can increase the amount of data to be trained, prevent model fitting, and improve prediction accuracy. When VGGNet [37] reaches a certain depth. However, performance saturation occurs.
To overcome the aforementioned difficulties, the Inception [38] and ResNet [39] networks were designed. The introduction network is a GoogleNet module that conducts convolutions or pooling operations on incoming photos in parallel and splices all of the outputs into an intense feature map. Within the constraints of computing resources, the network's performance may be increased further. By channeling input information directly to the output, a nonlinear modification of inputs before ResNet replacement speeds up neural network training and improves model accuracy and generalization. Although ResNet enables the network to break through hundreds of layers, an intense network might produce issues such as gradient fading and explosions. ResNet and Inception are combined into ResNeXt, created by stacking numerous residuals with the same topology and increasing cardinality [15]. The above networks improve the network's performance by increasing the width or depth, but these networks need to be manually tuned to achieve a better level. To solve these problems, Google has proposed EfficientNet [40] and Effi-cientNetv2 [41], which use the compound scaling method, which uses a blending factor φ to scale the network's width, depth, and resolution uniformly. The EfficientNet series networks not only have smaller network parameters but also have higher accuracy. The Fused-MBConv structure is proposed for the slow training time of EfficientNet series networks. The trainingaware NAS and scaling are jointly applied to optimize the model accuracy, speed, and parameter size, and the progressive learning method is proposed to reduce the training time.

Double spatial pyramid module
For semantic segmentation, context information and global context information are critical. Multiscale context information focuses on aggregating the context information at various scales to aid segmentation for objects of different sizes in the same category. The global information aims to provide a comprehensive understanding of the entire scene by establishing global range dependencies between pixels. Inspired by ASPP [23], we designed the Double Spatial Pyramid Module (DSPM) to obtain global and multiscale context information. The details of the DSPM are shown in Fig 2. For the feature maps F in 2 R H�W�C output from the backbone network, we first input them into a two-branch structure composed of two Spatial Pyramid Modules (SPM) in parallel. For SPM1, one 3 × 3 depthwise separable convolution [15] and three 3 × 3 dilated convolutions with different atrous rates are input to capture the multiscale contextual information, and a global averaging pooling operation is used to capture the global information. After that, we use Concatenate operation to stitch the feature map in channel dimension to obtain the feature map F concat 2 R H�W�5C , and then use a 1x1 convolution to downscale and interact with the information between channels to obtain the output feature map F out 2 R H�W�C , and the same for SPM2. Different receptive fields are required for objects of different sizes that contain the same class in the image. We set the atrous rate r differently in SPM1 and SPM2. In SPM1, the atrous rate r is set to [4,8,12] to capture smaller objects in the image; in SPM2, the atrous rate r is set to [12,24,36] to capture larger objects in the image.

Focusing selective fusion module
To better fuse the output features of DSPM and different levels of features, we designed three feature fusion methods, as shown in F++ig 4. Ablation experiments demonstrate the optimal performance of the Focusing Selective Fusion Module (FSFM). We focus on the proposed FSFM.
Most of the channel attention mechanisms use global average pooling (GAP) operations to obtain a global representation of each channel. However, this approach results in a loss of information details. FcaNet [8] demonstrated that the lowest frequency components of GAP and 2DDCT are proportional. Therefore, we use the two-dimensional discrete cosine transform to obtain the multispectral vector. As shown in Fig 3, the image information is compressed in the upper left corner from two examples of image DCT transformation. Moreover, the channel attention mechanism does not exploit the relationship between different spatial locations. We introduced a spatial attention mechanism based on the designed channel attention mechanism to learn more representative features. In summary, we designed a frequency selective fusion module (FSFM) by using discrete 2DDCT to obtain multiple frequency components for each feature, which then serves as a guide to adaptively assign corresponding weights to feature maps containing relationships between different spatial locations.
The spatial attention mechanism receives feature maps F1 and F2 and the spatial relationship between them is used to build two two-dimensional spatial weight maps W1 and W2, which are then multiplied by the appropriate spatial locations to learn more features. We produce feature descriptors by concatenating them using the average pooling and maximum pooling operations. Then, we connect the two feature descriptors using a 7x7 convolution operation to generate the appropriate spatial attention maps. W s is the spatial attention map computed as follows: The feature maps F1 and F2 are then fed into the spatial attention mechanism to yield W1 and W2, respectively, and then multiplied by the corresponding spatial positions to yield the spatial relations feature maps K1 and K2.
where � is element-wise multiplication.
The following, the two feature maps F1 2 R H�W�C and F2 2 R H�W�C are used to generate the feature map F 2 R H�W�2C by the concatenation operation: where Concat denotes the concatenate operation. The feature map F is divided into n parts along the channel dimension, [F 1 , F 2 , � � �, F n ], in which F x 2 R 1�H�W ; x 2 f1; 2; � � � ; ng; n ¼ 2C. The input feature map was split into 2C parts, and each channel in the feature map was converted into a corresponding 2-D DCT. This can compress the multi-frequency component, including the lowest frequency component, to obtain more information. In addition, FcaNet has demonstrated that the lowest frequency components of GAP and 2DDCT are proportional. The forward and inverse transformations of two-dimensional discrete cosine transform are shown in Eq.5 and Eq.6, respectively: Dðu; vÞ ¼ aðuÞaðvÞ where Dðu; vÞ 2 R H�W is the 2-D DCT frequency spectrum, f ði; jÞ 2 R H�W is the input feature, H and W are the height and width of the f(i, j); Then, the divided feature maps were substituted into Eq.5 to obtain D x freq 2 R 1�1�1 ; x 2 f1; 2; � � � ; ng; n = 2C. The whole attention mask vector can be obtained by concatenation operation on the obtained results: where D freq 2 R 2C�1�1 is the obtained mask vector. Then, the mask vectors D freq 2 R 2C�1�1 through the FRF layer: where FRF layer consists of two 1x1 convolutions 2 R C with a convolution kernel size of 5.
The W3 and W4 are parameters of two 1x1 convolutions layers, respectively. Then, the guide vector G is used to compute attention weights. The guide vectors G were reshaped into two guide tensors P 2 R 1�1�C and Q 2 R 1�1�C . Inspired by SKNet [6], we adaptively adjust the weight of the input feature maps in the FSFM module. We converted to frequency attention weight vector W A 2 R 1�1�C and W B 2 R 1�1�C for K1 and K2 respectively through the element-wise softmax operation: where W A c is the c-th element of W A 2 R 1�1�C , P c is the c-th element of P, likewise W B c and Q c . w A c þ w B c ¼ 1. The fused feature map N 2 R H�W�C is obtained by reallocating attention weights to different convolution kernels: where N ¼ ½N 1 ; N 2 ; . . . ; N C �; N c 2 R H�W , K1 c is the c-th row of K1, likewise K2 c ; � indicates element-wise summation, � indicates element-wise multiplication.

Network agriculture
Based on the DSPM and FSFM components, we designed the architecture of the MFEAFN, as shown in Fig 1. Unlike ResNet, which increases width and dimensionality, EfficientNetv2-S uses a hybrid factor φ to scale the network's width, depth, and resolution to improve its performance. Therefore, we employ the EfficientNetv2-S as our backbone. Then, the DSPM is designed to extract both global and multiscale context information from the backbone Effi-cientNetv2-S by 16 times downsample. Then, we take the output of the DSPM as the input of two FSFM to enhance and adaptively fuse the multiscale context information for objects of different sizes. Previous studies have shown that spatial detail information is important for improving network performance. The low-level features have high resolution and contain rich location and detailed information. The high-level features contain semantic information. The two different levels of features do not contain the same information, so it is impossible to fuse the two features using the concatenate operation. We also use FSFM to enhance the adaptive fusion of these different levels of features. Finally, a 3 × 3 standard convolutional refinement feature is used to obtain the final output after quadruple upsampling.

The experimental configuration and implementation details and evaluation metrics
The algorithm proposed in our method has been experimentally studied. The experimental software and hardware configuration are shown in Table 1.
We evaluated MFEAFN with deeplabv3+, mean intersection over union (mIoU), intersection over union (IoU), overall accuracy (OA), and mean pixel accuracy (mPA) on the PASCAL VOC 2012 dataset [42] and the Cityscapes dataset [22] as the evaluation metrics of the model: where k is the classes used in the experiments and p ii , p ij , and p ji denote the pixel number of true positives, false positives, and false negatives, respectively. IoU measures the ratio of the intersection of a category's predicted and actual values to their union. mIoU measures the ratio of the intersection of the predicted outcomes and the actual values for each category, summed and averaged. The mPA calculates the proportion of pixels correctly classified for each class separately and then sums and averages them. IoU, mIoU, and mPA are all standard metrics for measuring model performance in semantic segmentation.

Datasets and implementation details
Pascal VOC 2012 and Cityscapes. Original PASCAL VOC 2012 dataset [43] contains 1, 464 (train), 1, 449 (Val), and 1, 456 (test) pixel-level annotated images and 20 foreground object classes, and one background class. We augment the dataset with the extra annotations provided by [42], resulting in 10, 582 (trainaug) training images. In the experiments, In our experiments, we use the "poly" strategy [44] as our learning rate strategy, with the initial learning rate set to 0.01, weight decay to 0.0005, SGD network model optimizer, the momentum of 0.9, batch size 16, crop size 512 512, and 50 epochs. The Cityscapes dataset contains street scenes from 50 different cities, in addition to highquality annotations of 5000 pixel-level frames (2975, 500, and 1525 for the training, validation, and test sets, respectively) and 20,000 coarsely labelled images. The experiments' initial learning rate is set to 0.1, batch size to 8, crop size of 768 × 768, and 80 epochs.

Ablation study on PASCAL VOC2012 val set
Ablation study of DSPM on PASCAL VOC2012 val set. In Table 2, the first line uses the ASPP from the original paper [23], and the second line uses an additional 3 × 3 dilated convolution with atrous rates of 24 for obtaining long-range dependencies in the ASPP module. Experimental results show that the network's performance degrades by 0.11% compared to the original ASPP. To further improve the network's performance, we tried to replace the 1 × 1 convolutional layer of ASPP with a 3 × 3 depth-separable convolutional layer. As seen from the first and fourth lines of Table 2, MASPP slightly improves compared with ASPP. As seen from the first and fourth lines of Table 2, MASPP slightly improves compared with ASPP.
Ablation study of FSFM on PASCAL VOC2012 val set. To verify the effectiveness of the proposed feature fusion module FSFM, we conducted ablation experiments as shown in Table 3. We used ResNet101 and DSPM as the baseline. We compare the four feature fusion modules in Fig 4. As shown in Table 3, using the commonly used concatenation is 0.58% mIoU higher than the baseline network, which proves that the Concat method of fusing features can effectively improve the performance of the network. Compared with line 3 with line 2 of Table 3, we replace the concatenation operation with SFM, and the mIoU acquires a further improvement from 79.08% to 79.35%, which proves the effectiveness of the SFM. Because SFM can adaptively fuse the required features at different scales, From line 3 to line 4, we observe that we add the spatial attention mechanism module SSFM to both branches of SFM separately, which is 0.3% mIoU higher than SFM. Because SSFM not only adaptively selects essential information in the channel dimension but also emphasizes or suppresses information in the spatial dimension. The FSFM designed by replacing the global average pooling (GAP) with 2DDCT on top of SSFM obtains 80.36% mIoU and the best network performance compared with the above three feature fusion methods. Because 2DDCT can obtain more information, including GAP in the adaptive selection of the focused information through the attention mechanism.
To get a more intuitive understanding of the function of our proposed feature fusion module, we visualized some images from the PASCAL VOC2012 dataset, as shown in Fig 5. We can see that our method not only focuses well on single or multiple objects containing only one category of objects in the picture but also on different categories of objects visually.   the original image are shown in the second column. The third column shows the predicted map of the MFEAFN output. The second column in Fig 7 shows that the MFEAFN mask does not cover the original image well for small distant objects such as "boats," and "birds," and that there is room for further improvement of our proposed method for small foreign objects. In addition, we compared the MFEAFN proposed in this paper with other segmentation networks on the PASCAL VOC2012 validation dataset. The comparison results are shown in Table 4.  bicycle in the image is more complete than the Deeplabv3+ network, and the boundary of the car is predicted more carefully. In the second row of Fig 8, DeepLabv3+ incorrectly classifies the pixels in the detail area of the "building" as "pole". It causes the segmentation effect to lose detailed information on the "sidewalk" in the lower right corner. In line 3 of Fig 8, DeepLabv3 + incorrectly predicts the "terrain" category to the "sidewalk" category, causing class confusion and missing edge details. The fifth column in Fig 7 is more continuous in its segmentation of objects than the fourth; for example, the segmentation of "traffic light" and "pole" is more contiguous. In addition, we also compare the MFEAFN proposed in this paper with other segmentation networks on the Cityscapes validation dataset. The comparison results are shown in Table 5.

Conclusion
We designed a Double Spatial Pyramid Module (DSPM)to extract objects of different sizes in the same category more efficiently. In addition, to better fuse the characteristics of different  scales or levels, we built the Frequency Selective Fusion Module (FSFM), which can enhance the adaptive fusion of these features by generating spatial and frequency correlation weight mappings for each feature map. Based on the DSPM and FSFM Module, we propose a multiscale feature enhancement adaptive fusion network (MFEAFN) that effectively solves the problems of local information loss and class confusion. Experimental results of the proposed algorithm on the PASCAL VOC 2012 and Cityscapes data sets show that MFEAFN has better segmentation performance than state-of-the-art methods.